MCIC Wooster, OSU
2024-02-01
https://en.wikipedia.org/wiki/Protein_isoform
The transcriptome is the full set of transcripts expressed by an organism, which:
Is not at all stable across time & space in any given organism,
unlike the genome but much like the proteome
Varies both qualitatively (which transcripts are expressed) but especially quantitatively (how much of each transcript is expressed)
Transcriptomics is the study of the transcriptome, i.e. the large-scale study of RNA transcripts expressed in an organism.
There are many approaches & applications — but most commonly, transcriptomics focuses on:
mRNA rather than on noncoding RNA types such as rRNA, tRNA, and miRNA
Quantifying gene expression levels (cf. most genomics approaches)
Statistically comparing expression levels between groups (treatments, biotypes, tissues, etc.)
[TODO: Add count comparison figure]
Considering…
That protein production gives clues about the activity of specific biological functions, and the molecular mechanisms underlying those functions;
That it is much easier to measure protein expression than transcript expression at large scales;
The central dogma
… we can use gene expression levels as a proxy for protein expression levels and make functional inferences.
Specifically, we can use transcriptomics to:
Find the pathways and genes that:
Underlie phenotypic responses
Explain differences between groups (treatments, genotypes, sexes, tissues, etc.)
Can be targeted to enhance or reduce responses for pathogen and pest control
RNA-seq is the current state-of-the-art family of methods to study the transcriptome.
It involves the random (“shotgun”) sequencing of millions of transcript fragments per sample.
We will focus on the most common type of RNA-seq, which:
Does not actually sequence the RNA, but first reverse transcribes RNA to cDNA
Attempts to sequence only mRNA while avoiding noncoding RNAs (“mRNA-seq”)
Does not distinguish between RNA from different cell types (“bulk RNA-seq”)
Uses short reads (up to 150 bp) that do not cover full transcripts but do uniquely identify genes
Which type of RNA-seq?
When people say “RNA-seq” without further specifications, you can assume they’re talking about this type.
RNA-seq data can also be used for applications other than expression quantification, such as to:
For organisms without a reference genome: characterize the transcriptome, i.e. identify genes present in the organism
For organisms with a reference genome: discover new genes & transcripts, and improve genome annotation
All in all, RNA-seq is a very widely used technique — constitutes the most common usage of high-throughput sequencing!
RNA-seq is also the most common data type I assist with as a bioinformatician at the MCIC. Some projects I’ve been involved in used RNA-seq to identify genes & pathways that differ between:
Multiple soybean cultivars in response to Phytophtora sojae inoculation; soybean in response to different Phytophtora species and strains (Dorrance lab, PlantPath)
Wheat exposed to Xanthomonas with a gene knock-out vs. knock-in (Jacobs lab, PlantPath)
Mated and unmated mosquitos (Sirot lab, College of Wooster)
Tissues of the ambrosia beetle and its symbiotic fungus (Ranger lab, USDA Wooster)
Diapause-inducing conditions for two pest stink bug species (Michel lab, Entomology)
Human carcinoma cell lines with vs. without expression manipulation of a gene (Cruz lab, CCC)
Pig coronaviruses with vs. without an experimental insertion (Wang lab, CFAH)
As well as to improve the genome annotation of a plant-parasitic nematode (Taylor lab, PlantPath).
[TODO: 1 or 2 figs, e.g. from Tracy]
RNA-seq is not the only method to quantify gene expression, but often the preferred one:
Microarrays — hybridization of cDNA to preconstructed probes
Microarrays were the transcriptomics method of choice before the advent of high-throughput sequencing, but disadvantages relative to RNA-seq include:
Produce relative measurements only with a much smaller dynamic range than RNA-seq
Can only capture known features: cannot discover or examine novel features, and do not work for species without a (close) reference genome
Brown & Botstein 1999 (www.nature.com/articles/ng0199supp_33)
Small RNA-seq for small noncoding RNAs such as microRNAs and Piwi-interacting RNAs
Single-cell RNA-seq (scRNA-seq) — commonly used in cancer research
Spatial RNA-seq – TODO add to this
Direct RNA sequencing
The rest of this lecture and tomorrow’s lab will only focus on bulk RNA-seq of mRNA-derived cDNA.
Berge et al. 2019 (www.annualreviews.org/doi/10.1146/annurev-biodatasci-072018-021255)
RNA-seq typically compares groups of samples defined by differences in:
Treatments (e.g. different host plant, temperature, diet, mated/unmated) and/or
Organismal variants: ages/developmental stages, sexes, or genotypes (lines/biotypes/subspecies/morphs) and/or
Tissues
To be able to make statistically supported conclusions about expression differences between such groups of samples, we must have biological replication.
Then, a so-called differential expression analysis compares sample groups and estimates a P-value for every single expressed (measured) gene.
When designing an RNA-seq experiment, keep the following in mind:
In agricultural research, it is common to study both a plant host and its adversary.
If the adversary is a microbial/viral/fungal/oomycete pathogen, it is possible or necessary to collect samples that contain both host and pathogen.
When these samples are subjected to RNA-seq, they will contain both host and pathogen reads.
These reads can be separated bioinformatically and when both host and pathogen reads are analyzed, this is called a dual RNA-seq experiment.
Berge et al. 2019 (www.annualreviews.org/doi/10.1146/annurev-biodatasci-072018-021255)
Kukurba & Montgomery 2015 (www.ncbi.nlm.nih.gov/pmc/articles/PMC4863231/)
The first step is to extract RNA from tissue samples.
This is typically the only step you conduct yourself in the lab, with sequencing facilities taking care of downstream procedures.
Some considerations:
The tissue sample should be collected from an organism that is alive or recently deceased (matter of minutes), and the RNA should be extracted immediately.
With the appropriate buffer and temperate, an RNA extraction can be stored long-term.
Highly recommended to include a DNase treatment step to avoid DNA contamination.
Important to perform Quality Control (QC) on your extractions: quantity (e.g. with a Nanodrop or Qubit) and quality (degradation; e.g. with a gel).
“Library preparation” is the series of lab steps to produce, from one or more RNA extraction, a collection of molecules ready to be sequenced: a sequencing “library”. (A sequencing facility typically does this for you.)
Let’s consider different options you have for library prep:
mRNA enrichment
Needed because mRNA makes up only a few percent of RNA molecules, and done using either:
PolyA-selection: specifically selects mature (spliced) mRNAs
Ribodepletion: fishes out rRNA — this is suboptimal for mRNA-seq (presence of other RNAs and of pre-mRNAs) but can be a last resort with poor RNA quality
Library strandedness
Libraries can be “stranded” or “unstranded” —stranded libraries allow you to tell the directionality of a transcript, which in turn allows you to:
Distinguish between overlapping genes
Assess whether you may have DNA contamination
Sequencing technology
Illumina short reads are by far the most common
PacBio or ONT long reads are an alternative if sequencing full transcripts (isoforms) is key
Single-end vs. paired-end reads (for Illumina)
Sequencing “depth” — how many reads per sample
Guidelines highly approximate (cf. in genomics) — depends not just on transcriptome size, but also on expression level distribution, expression levels of genes of interest, etc.
Typical recommendations are 20+ million reads per sample [CHECK]
For statistical power, more replicates are better than a higher sequencing depth
TODO
Modified after Kukurba & Montgomery 2015
You will typically receive a “demultiplexed” (split by sample) set of FASTQ files.
Most commonly, sequencing for RNA-seq experiments is done on a single lane, so you will receive one (single-end) or two (paired-end) files per sample.
These are pretty large even in compressed form (1 or a few Gb each).
[ADD SCREENSHOT]
Once you receive your data, the first series of analysis steps involves going from the raw reads to a count table (the count table will have a read count for each gene in each sample).
This part is bioinformatics-heavy with large files, a need for lots of computing power such as with a supercomputer, command-line (Unix shell) programs — it specifically involves:
Read preprocessing: QC, trimming, and optionally rRNA removal
Aligning reads to a reference genome (+ alignment QC)
Quantifying expression levels (per-gene & per-sample to create a count table)
This can be run using standardized, one-size-fits-all workflows, and is therefore (relatively) suitable to be outsourced to a company, facility, or collaborator.
Read pre-processing includes the following steps:
Checking the quantity and quality of your reads
Does not change your data, but helps decide next pre-processing steps / sample exclusion
Also useful to check for contamination, library complexity, and adapter content
Typically done with the tool FastQC (and MultiQC for across-sample summaries)
TrimGalore or Trimmomatic.SortMeRNA (optional).Kraken2 (optional).Transcript-level vs. gene-level
The terminology “transcript-level” vs. “gene-level”, e.g. in “transcript-level counts” refers to the distinction between having separate counts for each isoform (AKA transcript) versus a single count for each gene. Gene-level counts may either have been aggregated across isoforms, or reads were never assigned to isoforms in the first place.
The alignment of reads to a reference genome needs to be “splice-aware”.
Alternatively, you can align to the transcriptome (i.e., all mature transcripts).
Berge et al. 2019 (www.annualreviews.org/doi/10.1146/annurev-biodatasci-072018-021255)
The alignment of reads to a reference genome needs to be “splice-aware”.
Alternatively, you can align to the transcriptome (i.e., all mature transcripts).
Alignment usually consists of two steps:
One-time creation of a tool-specific “index” of the reference genome/transcriptome.
For each sample independently, alignment of the reads to the reference index.
Alignment tools
Some of the currently most commonly used alignment tools are HISAT2 & STAR (map to genome), and RSEM & Salmon & Kallisto (map to transcriptome).
Without a reference genome, an RNA-seq project is considerably more work.
These two extra steps are needed:
Trinity. (This is quite compute-intensive!)Transcriptome annotation
Then, your transcriptome will need to be functionally annotated:
This means you try to assign gene names and functional categories to your transcripts
A very similar process to genome annotation, and works in part by using BLAST or related tools to find orthologous, already annotated, genes in other organisms.
Then, you can align to your newly assembled transcriptome the same way you would align to a reference transcriptome (obviously, you won’t have the option to align to the genome).
These kind of stats can only be assessed after aligning to the genome (vs. to the transcriptome)1.
Alignment rates
What percentage of reads was successfully aligned?
Low rates (<80%) => cross-species contamination or poor reference genome quality
Multi-mapped reads (mapping to multiple locations in the genome) are typically accepted by the mapper but are not or probabilistically quantified — more on this later.
Alignment targets
What percentages of aligned reads mapped to exons vs. introns vs. intergenic regions?
High intronic mapping rates => presence of pre-mRNA
High intergenic mapping rates => DNA contamination or poor genome assembly/annotation quality
Alignment QC tools
You can get these stats by checking the aligner’s “log” outputs, and by running tools like QualiMap and RSeQC.
At heart, a simple counting exercise once you have the alignments in hand.
But made more complicated by sequencing biases and multi-mapping reads.
Several metrics of gene expression levels exist:
Adjusted by gene length and library size
E.g., FPKM (superseded) and TPM (Transcripts Per Million)
Raw counts
These are used by most downstream approaches (more later).
Current best-performing tools (Salmon, Kallisto) do transcript-level quantification — even though this is typically followed by gene-level aggregation prior to downstream analysis.
Fast-moving field
Several very commonly used tools like FeatureCounts (>15k citations) and HTSeq (<18k citation) have become disfavored in the past couple of years, as they e.g. don’t count multi-mapping reads at all.
Nf-core RNAseq workflow cc
The second part of RNA-seq data analysis involves analyzing the count table.
In contrast to the first part, this can be done on a laptop and instead is heavier on statistics, data visualization and biological interpretation.
DIY!
This part also more custom and iterative and I do not recommend to have this done by a third party.
Typically done using the specialized RNA-seq “packages” in the R language. Common steps involve:
PCA
Assessing overall sample clustering patterns with a Principal Component Analysis (PCA)
Differential expression analysis
Finding genes that differ in expression level between sample groups (DEGs)
Functional enrichment analysis
See whether certain gene functions are overrepresented among DEGs
Two or more groups are typically compared, such as plants exposed to virulent vs. avirulent aphids.
[PCA example fig]
More than two groups
Plants exposed to:
Pairwise comparisons:
More than one independent variable
Statistical designs other than pairwise comparisons are also possible. For example:
Controlling for an independent variable (fixed or random effect)
Assessing the effect of one variable while controlling for another. E.g.:
Which genes respond similarly to a virulent (vs. avirulent) pathogen in resistant and susceptible plants?
Which genes respond similarly to a virulent (vs. avirulent) pathogen regardless of field plot?
Gene count normalization
To be able to fairly compare samples, raw counts need to be adjusted:
What about gene lengths?
Even though longer genes are more likely to be samples and gene counts are therefore confounded by gene length, there is no normalization by gene length, because genes are not compared.
Probability distribution of the count data
Multiple-testing correction
10,000+ genes are independently tested during a DE analysis, so there is a dire need for multiple testing correction.
The standard method is not the very strict Bonferroni correction but the Benjamini-Hochberg (BH) method.
Log2-fold changes (logFC/LFC) as a measure of expression difference
Specialized R/Bioconductor packages like DESeq2 and EdgeR make differential expression analysis relatively straightforward and automatically take care of the abovementioned considerations (we will use DESeq2 in the lab).
Lists of DEGs can be quite long, and it is not always easy to make biological sense of them. Functional enrichment analyses help with this.
They check whether certain functional categories of genes (e.g., biological processes, or pathways) are statistically overrepresented among up- and/or downregulated genes.
There are a number of databases that group genes into functional categories, but the two main ones used for enrichment analysis are:
Gene Ontology (GO)
Kyoto Encyclopedia of Genes and Genomes (KEGG)
Genome annotations for a specific organism often but not always include GO or KEGG annotations.
KEGG annotations are more commonly missing but can also be more easily generated, by uploading a FASTA file with amino acid sequences to KEGG’s GhostKOALA webservice: https://www.kegg.jp/ghostkoala/.
Genes are assigned zero, one or more GO “terms”
Hierarchical structure with more specific terms grouping into more general terms
Consists of three “ontologies”:
Biological Process
Molecular Function
Cellular Component
Orthologs, modules and pathways
Contains fewer genes than GO
[Include pathway figure]
The two most common statistical approaches are:
| DE | not-DE | |
|---|---|---|
| In photosynthesis category | 15 | 30 |
| Not in photosynthesis category | 85 | 14,870 |